7 research outputs found

    Evaluating the impact of Permanent Faults in a {GPU} running a Deep Neural Network

    Get PDF
    Currently, Deep Neural Networks (DNNs) are fun-damental computational structures deployed in a wide range of modern application domains (e.g., data analysis, healthcare, automotive, robotics). The computational complexity is inherent in these cognitive models, which demand high-performance devices like Graphics Processing Units (GPUs). Therefore, the implementation of DNNs on GPU devices is becoming increasingly frequent, even for cutting-edge safety-critical applications (e.g., autonomous and semi-autonomous cars). Thus, the reliability evaluation of these applications is mandatory because several phenomena (including aging) may produce permanent defects in the GPU, thus inducing the DNN to produce wrong results. Until now, the effects of permanent faults on DNNs have been mainly investigated at the application level, only, e.g., acting on the parameters of the network. This paper presents an environment allowing for the first time a more detailed experimental evaluation of the impact of permanent faults in a GPU on the reliability of a DNN running on it, based on considering faults at the architectural level. The results of the fault injection campaigns we performed on the GPU register files are compared with those at the application level, proving that the latter ones are generally optimistic

    Exploring Hardware Fault Impacts on Different Real Number Representations of the Structural Resilience of TCUs in GPUs

    Get PDF
    The most recent generations of graphics processing units (GPUs) boost the execution of convolutional operations required by machine learning applications by resorting to specialized and efficient in-chip accelerators (Tensor Core Units or TCUs) that operate on matrix multiplication tiles. Unfortunately, modern cutting-edge semiconductor technologies are increasingly prone to hardware defects, and the trend to highly stress TCUs during the execution of safety-critical and high-performance computing (HPC) applications increases the likelihood of TCUs producing different kinds of failures. In fact, the intrinsic resiliency to hardware faults of arithmetic units plays a crucial role in safety-critical applications using GPUs (e.g., in automotive, space, and autonomous robotics). Recently, new arithmetic formats have been proposed, particularly those suited to neural network execution. However, the reliability characterization of TCUs supporting different arithmetic formats was still lacking. In this work, we quantitatively assessed the impact of hardware faults in TCU structures while employing two distinct formats (floating-point and posit) and using two different configurations (16 and 32 bits) to represent real numbers. For the experimental evaluation, we resorted to an architectural description of a TCU core (PyOpenTCU) and performed 120 fault simulation campaigns, injecting around 200,000 faults per campaign and requiring around 32 days of computation. Our results demonstrate that the posit format of TCUs is less affected by faults than the floating-point one (by up to three orders of magnitude for 16 bits and up to twenty orders for 32 bits). We also identified the most sensible fault locations (i.e., those that produce the largest errors), thus paving the way to adopting smart hardening solutions

    Effective fault simulation of GPU’s permanent faults for reliability estimation of CNNs

    No full text
    Convolutional Neural Networks (CNNs) and Graphic Processing Units (GPUs) are now increasingly adopted in many cutting edge safety-critical applications. Consequently, it is crucial to evaluate the reliability of these systems, since the hardware can be affected by several phenomena (e.g., wear out of the device), producing permanent defects in the GPU. These defects may induce wrong outcomes in the CNN that may endanger the application. Traditionally, the study of the effects of permanent faults on CNNs has been approached by resorting to application-level fault injection (e.g., acting on the weights). However, this approach has restricted scope, and it may not reveal the actual vulnerabilities in the GPU device. Hence, a more accurate evaluation of the fault effects is required, considering more in-depth details of the device’s hardware. This work introduces a more elaborated experimental evaluation of the impact of GPU’s permanent faults on the reliability of a CNN by resorting to a Software-Implemented Fault Injection(SWIFI) strategy, considering faults at the hardware level. The results of the fault simulation campaigns we performed on the GPU data-path cores are compared with those at the application level, proving that the latter ones are generally optimistic

    Reliability Assessment of Neural Networks in GPUs: A Framework For Permanent Faults Injections

    Full text link
    Currently, Deep learning and especially Convolutional Neural Networks (CNNs) have become a fundamental computational approach applied in a wide range of domains, including some safety-critical applications (e.g., automotive, robotics, and healthcare equipment). Therefore, the reliability evaluation of those computational systems is mandatory. The reliability evaluation of CNNs is performed by fault injection campaigns at different levels of abstraction, from the application level down to the hardware level. Many works have focused on evaluating the reliability of neural networks in the presence of transient faults. However, the effects of permanent faults have been investigated at the application level, only, e.g., targeting the parameters of the network. This paper intends to propose a framework, resorting to a binary instrumentation tool to perform fault injection campaigns, targeting different components inside the GPU, such as the register files and the functional units. This environment allows for the first time assessing the reliability of CNNs deployed on a GPU considering the presence of permanent faults.Comment: Paper accepted to be presented in The 2022 IEEE International Symposium on Industrial Electronics, Anchorage, Alaska, USA, June 1-3, 2022 to be published in the IEEE xplorer after the presentation in the even

    Reliability Assessment of Neural Networks in GPUs: A Framework For Permanent Faults Injections

    No full text
    Currently, Deep learning and especially Convolutional Neural Networks (CNNs) have become a fundamental computational approach applied in a wide range of domains, including some safety-critical applications (e.g., automotive, robotics, and healthcare equipment). Therefore, the reliability evaluation of those computational systems is mandatory. The reliability evaluation of CNNs is performed by fault injection campaigns at different levels of abstraction, from the application level down to the hardware level. Many works have focused on evaluating the reliability of neural networks in the presence of transient faults. However, the effects of permanent faults have been investigated at the application level, only, e.g., targeting the parameters of the network. This paper intends to propose a framework, resorting to a binary instrumentation tool to perform fault injection campaigns, targeting different components inside the GPU, such as the register files and the functional units. This environment allows for the first time assessing the reliability of CNNs deployed on a GPU considering the presence of permanent faults

    A Reliability-aware Environment for Design Exploration for GPU Devices

    No full text
    Nowadays, GPU platforms have gained wide importance in applications that require high processing power. Unfortunately, the advanced semiconductor technologies used for their manufacturing are prone to different types of faults. Hence, solutions are required to support the exploration of the resilience to faults of different architectures. Based on this motivation, this work presents an environment dedicated to the analysis of the impact of permanent faults on GPU platforms. This environment is based on GPGPU-Sim, with the objective of exploiting the configuration features of this tool and, thus, analyzing the effects of faults when changing the target architecture. To validate the environment and show its usability, a fault campaign has been carried out where three different GPU architectures (Kepler, Volta, and Turing) were used. In addition, each GPU has been modified with an arbitrary number of parallel processing cores (or SMs). Three representative applications (Vector Add, Scalar Product, and Matrix Multiply) were executed on each GPU, and the behavior of each architecture in the presence of permanent faults in the functional (i.e., integer unit and floating-point) units was analyzed. This fault campaign shows the usability of the environment and demonstrates its potential use to support decisions on the best architectural parameters for a given application

    Analyzing the Impact of Different Real Number Formats on the Structural Reliability of TCUs in GPUs

    No full text
    Modern Graphics Processing Units (GPUs) boost the execution of tiled matrix multiplications by extensively using in-chip accelerators (Tensor Core Units or TCUs). Unfortunately, cutting-edge semiconductor technologies are increasingly prone to fault defects. Indeed, faults may affect TCUs when processing massive amounts of data under classical floating-point formats, raising reliability concerns when used in the safety-critical and High-Performance Computing (HPC) domains. In this scenario, the characterization of faulty TCUs supporting different arithmetic formats is still missed. This work for the first time quantitatively evaluates the effects of hardware faults arising in TCU structures when using two different formats for real number representation (i.e., Floating-Point and Posit). For the experimental evaluation, we resort to an architectural description of a TCU core (PyOpenTCU) and perform 60 fault simulation campaigns, injecting 57,344 faults per campaign and requiring around 24 days of computation. The experimental results indicate a relation between the corrupted spatial areas in the output matrices and the TCU’s scheduling policies. Moreover, the numeric analysis shows that hardware faults in TCUs in most cases affect up to 2 bits in the output results for both considered formats. The results also demonstrate that the Posit formats are less affected by faults than Floating-Point formats by up to one order of magnitude
    corecore